GVPT Maths Boot Camp

Exploratory Data Analysis

Learning objectives

  1. Learn how to generate questions about your data

  2. Learn how to discern interesting relations in your data

  3. Use your new data science tools to better understand your data

Two basic questions to guide your EDA

  1. What type of variation occurs within my variables?


  1. What type of covariation occurs between my variables?

Examining gapminder

library(gapminder)
library(dplyr)

head(gapminder)
# A tibble: 6 × 6
  country     continent  year lifeExp      pop gdpPercap
  <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Afghanistan Asia       1957    30.3  9240934      821.
3 Afghanistan Asia       1962    32.0 10267083      853.
4 Afghanistan Asia       1967    34.0 11537966      836.
5 Afghanistan Asia       1972    36.1 13079460      740.
6 Afghanistan Asia       1977    38.4 14880372      786.

Variation: Factors

How many countries are there in our data set?

gapminder |> 
  distinct(country) |> 
  nrow()
[1] 142

How many continents?

gapminder |> 
  distinct(continent) |> 
  nrow()
[1] 5

EXERCISE: How many countries in each continent?

Variation: Numeric

What is the earliest and latest year we cover?

summarise(gapminder, min(year), max(year))
# A tibble: 1 × 2
  `min(year)` `max(year)`
        <int>       <int>
1        1952        2007

What about our other numeric variables?

summarise(gapminder, across(lifeExp:gdpPercap, ~ quantile(.x)))
# A tibble: 5 × 3
  lifeExp         pop gdpPercap
    <dbl>       <dbl>     <dbl>
1    23.6      60011       241.
2    48.2    2793664      1202.
3    60.7    7023596.     3532.
4    70.8   19585222.     9325.
5    82.6 1318683096    113523.

The Five Number Summary

The five number summary is a useful way to summarise numeric data. Consists of the:

  • Minimum

  • 25th percentile,

  • 50th percentile (mean or average),

  • 75th percentile

  • Maximum

Visualising the Five Number Summary

library(ggplot2)

ggplot(gapminder, aes(y = lifeExp)) + 
  geom_boxplot() + 
  theme_minimal()

Visualising the IQR for groups

library(ggplot2)

ggplot(gapminder, aes(x = continent, y = lifeExp)) + 
  geom_boxplot() + 
  theme_minimal()

Visualising the distribution of numeric variables

ggplot(gapminder, aes(x = lifeExp)) + 
  geom_histogram() + 
  theme_minimal()

Visualising the distribution of numeric variables

ggplot(gapminder, aes(x = lifeExp)) + 
  geom_density() + 
  theme_minimal()

Visualising the distribution of numeric variables

ggplot(gapminder, aes(x = lifeExp, fill = continent)) + 
  geom_density(alpha = 0.5) + 
  theme_minimal()

Visualising counts

gapminder |>
  distinct(continent, country) |> 
  count(continent) |> 
  ggplot(aes(x = n, y = reorder(continent, n))) + 
  geom_col() + 
  theme_minimal()

Identifying unusual values

ggplot(gapminder, aes(x = gdpPercap)) + 
  geom_histogram() + 
  theme_minimal()

Identifying unusual values

ggplot(gapminder, aes(x = gdpPercap)) + 
  geom_boxplot() + 
  theme_minimal()

Identifying relationships in your data

Does one variable tend to move in the same direction as another?

ggplot(gapminder, aes(x = log(gdpPercap), y = lifeExp)) + 
  geom_point() + 
  theme_minimal()

A preview of linear regression

There has to be an easier way!

A quick look with glimpse():

glimpse(gapminder)

A quick summary with skim():

install.packages("skimr")

skimr::skim(gapminder)

Summary

Today you:

  1. Learnt how to explore and visualise interesting relations in your data

  2. Used your new data science tools to better understand your data